Demo script for PathAnalyser R package which provides functionality for assessing ER and HER2 pathway activity in breast cancer transcriptomic datasets.
# If not already installed
install.packages("BiocManager")
BiocManager::install(c("GSVA", "pROC", "edgeR", "reshape2", "ggplot2","limma",
"VennDiagram", "NCmisc", "futile.logger"),
dependencies = TRUE)
install.packages("devtools")
devtools::install_github("a-thind/PathAnalyser")
Alternatively, you can clone the GitHub repository in a Linux / MacOS terminal:
git clone git@github.com:a-thind/PathAnalyser.git
Then type the following in R to install a local source version of the package:
library(utils)
install.packages("~/PathAnalyser", repo=NULL, type="source")
Once installation is completed, load the library to start using PathAnalyser:
library(PathAnalyser)
The read_expression_data function reads gene expression data from
a file with an delimiter, the function checks the delimiter of the file and reads the file accordingly. It returns an output with gene symbols/IDs as the row names and columns representing each sample IDs in the dataset.
ER_expression_set <- read_expression_data("../raw_data/toy_data.txt")
# ER_expression_set<-ER_expression_set[,1:200]
# print(ER_expression_set[1:5, 1:5])
ER_expression_set contains RNASeq raw read counts for 20 primary breast cancer samples, 10 of which have ER pathway activity (ER+) and 10 which have inactive ER pathway activity (ER-).
dim(ER_expression_set)
#> [1] 20124 20
# Expression data for first 6 genes
head(ER_expression_set)
#> TCGA.AR.A0TY TCGA.AO.A124 TCGA.BH.A0BG TCGA.AN.A0G0 TCGA.AO.A0J8
#> OR5H15 0 0 0 0 0
#> SPANXN2 0 0 0 0 0
#> TAF7 16841 6974 8211 5586 12210
#> TMEM107 189 851 581 433 1298
#> CAMK2A 10 20 18 27 19
#> TRAM1L1 233 198 426 328 1044
#> TCGA.D8.A1XZ TCGA.AO.A0JE TCGA.E9.A5FL TCGA.LL.A8F5 TCGA.C8.A12V
#> OR5H15 0 0 0 0 0
#> SPANXN2 0 0 0 0 0
#> TAF7 8627 9138 3219 7663 6867
#> TMEM107 629 253 226 1265 385
#> CAMK2A 30 47 72 45 18
#> TRAM1L1 385 467 153 342 33
#> TCGA.A2.A0CM TCGA.AC.A3TN TCGA.AC.A3QQ TCGA.A2.A0YF TCGA.E2.A15F
#> OR5H15 0 0 0 0 0
#> SPANXN2 0 0 0 0 0
#> TAF7 4416 2869 718 14649 7106
#> TMEM107 332 660 230 1738 319
#> CAMK2A 29 34 25 4 48
#> TRAM1L1 212 482 28 428 507
#> TCGA.AO.A1KQ TCGA.BH.A0E0 TCGA.BH.A18G TCGA.5L.AAT0 TCGA.A8.A07C
#> OR5H15 0 0 0 0 0
#> SPANXN2 0 0 0 0 0
#> TAF7 6351 3150 5278 4839 9226
#> TMEM107 633 234 245 284 565
#> CAMK2A 32 34 23 43 9
#> TRAM1L1 169 71 220 272 615
Gene signature files can be read by the read_signature_data function. It returns a dataframe with signature IDs/symbols in the first column and their expression pattern is represented as up r regulated or down regulated by +1 and -1 respectively.
er_sign_df <- read_signature_data("../raw_data/ESR1_UP.v1._UP.grp", "../raw_data/ESR1_DN.v1_DN.grp")
dim(er_sign_df)
#> [1] 160 2
head(er_sign_df)
#> genes expression
#> 1 ABAT 1
#> 2 ACADSB 1
#> 3 ADCY1 1
#> 4 ADCY9 1
#> 5 AGR2 1
#> 6 ANXA9 1
# up-regulated genes are given an expression value of 1
er_up_sign <- er_sign_df[er_sign_df$expression == 1 ,]
dim(er_up_sign)
#> [1] 101 2
# Down-regulated genes are given an expression value of -1
er_dn_sign <- er_sign_df[er_sign_df$expression == -1 ,]
dim(er_dn_sign)
#> [1] 59 2
The log_cpm_transformation function transforms the raw data by the method of log CPM and returns the transformed data. It also performs sanity check of the transformation by returning box plots for visualization of distribution of data before and after transformation.
# using toy data set as expression matrix
data_se <- HER2_data_se1
normalized_se <- log_cpm_transformation(ER_expression_set)
The gene names are first checked for inclusion in the gene signature data frame. Genes that are included in the gene signature are retained then subjected to further filtration, in which their expression across the data set measured. If a gene is not present in at least 10% of the total number of samples in the data set, the gene is dropped from the expression matrix. A console message is printed reporting the number of features (genes) retained in the final normalized expression matrix.
normalized_se <- check_signature_vs_dataset(normalized_se, er_sign_df)
#> Number of feature present in expression dataset: 147
gsva_scores_dist method visualizes the GSVA score distribution for the abundance of expression of the up-regulated and down-regulated gene sets of the gene signature for each sample.
gsva_scores_dist(er_sign_df, normalized_se)
# Add thresholds on plot
library(ggplot2)
data_threshs <- data.frame(Geneset=c("Up", "Down"), vline=c(-0.2, 0.2))
plot + geom_vline(xintercept=data_threshs$vline, linetype=2)
#> NULL
Shrinking the distance between the high and low expression thresholds would result in fewer “Uncertain” pathway activity classifications
# more relaxed thresholds, fewer uncertain labels
data_threshs <- data.frame(Geneset=c("Up", "Down"),
vline=c(-0.1, 0.1))
plot + geom_vline(xintercept=data_threshs$vline, linetype=2)
#> NULL
Conversely, expanding the distance between the thresholds would increase the frequency of “Uncertain” pathway activity classifications, as the number of samples meeting the thresholds for consistency and inconsistency would be reduced.
# more stringent thresholds, greater uncertain labels
data_threshs <- data.frame(Geneset=c("Up", "Down"), vline=c(-0.4, 0.4))
plot + geom_vline(xintercept=data_threshs$vline, linetype=2)
#> NULL
classes_df <- classify_GSVA_abs(er_sign_df, normalized_se,
up_thresh.low=-0.2, up_thresh.high=0.2,
dn_thresh.low=-0.2, dn_thresh.high=0.2)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#>
#> Active Inactive Uncertain
#> 9 10 1
#>
#> Total number of samples: 20
#> Total number of samples classified: 19
# default percentile = 25% (quartile)
classes_df.percent <- classify_GSVA_percent(er_sign_df, normalized_se)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#>
#> Active Inactive Uncertain
#> 4 5 11
#>
#> Total number of samples: 20
#> Total number of samples classified: 9
# custom percentile = 30%
classes_df.percent <- classify_GSVA_percent(er_sign_df, normalized_se,
percent_thresh=30)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#>
#> Active Inactive Uncertain
#> 6 6 8
#>
#> Total number of samples: 20
#> Total number of samples classified: 12
confusion_matrix <- calculate_accuracy("../raw_data/Sample_labels.txt",
classes_df.percent, "ER", display_statistics=TRUE,
display_roc_curve=TRUE)
#> [1] "Confusion Matrix for ER pathway"
#> [1] "--------------------------------------------------------------"
#> Actual Positive Actual Negative Actual Uncertain
#> Prediction Positive 6 0 0
#> Prediction Negative 0 6 0
#> Prediction Uncertain 4 4 0
#> [1] "--------------------------------------------------------------"
#> [1] "Statistics in Confusion Matrix "
#> [1] "--------------------------------------------------------------"
#> [1] "Proportion of classified samples: 60.00"
#> [1] "Accuracy amongst classified samples: 100.00"
#> [1] "True Positive(TP): 6"
#> [1] "True Negative(TN): 6"
#> [1] "False Negative(FN): 0"
#> [1] "False Positive(FP): 0"
#> [1] "--------------------------------------------------------------"
#> [1] "True Positive Rate(TPR)(sensitivity)(Recall): 100.00"
#> [1] "True Negative Rate(TNR)(specificity): 100.00"
#> [1] "Precision (Positive predictive value): 100.00"
#> [1] "False Positive Rate(FPR): 0.00"
#> [1] "False Negative Rate(FNR): 0.00"
#> [1] "--------------------------------------------------------------"
#> Actual Positive Actual Negative Actual Uncertain
#> Min. :0.000 Min. :0.000 Min. :0
#> 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:0
#> Median :4.000 Median :4.000 Median :0
#> Mean :3.333 Mean :3.333 Mean :0
#> 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:0
#> Max. :6.000 Max. :6.000 Max. :0
classes_pca(normalized_se, classes_df.percent, pathway_name = "ER")
The output of sessionInfo upon which this file was generated:
sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Mojave 10.14.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] plotly_4.10.0 pROC_1.18.0 GSVA_1.40.1
#> [4] reshape2_1.4.4 ggplot2_3.3.5 edgeR_3.34.1
#> [7] limma_3.48.3 reader_1.0.6 NCmisc_1.1.6
#> [10] PathAnalyser_0.0.0.9000
#>
#> loaded via a namespace (and not attached):
#> [1] bitops_1.0-7 matrixStats_0.61.0
#> [3] bit64_4.0.5 RColorBrewer_1.1-3
#> [5] httr_1.4.2 GenomeInfoDb_1.28.4
#> [7] tools_4.1.1 bslib_0.3.1
#> [9] irlba_2.3.5 utf8_1.2.2
#> [11] R6_2.5.1 HDF5Array_1.20.0
#> [13] lazyeval_0.2.2 DBI_1.1.2
#> [15] BiocGenerics_0.38.0 colorspace_2.0-3
#> [17] rhdf5filters_1.4.0 withr_2.5.0
#> [19] tidyselect_1.1.2 bit_4.0.4
#> [21] compiler_4.1.1 graph_1.70.0
#> [23] cli_3.2.0 Biobase_2.52.0
#> [25] DelayedArray_0.18.0 labeling_0.4.2
#> [27] sass_0.4.1 scales_1.1.1
#> [29] stringr_1.4.0 digest_0.6.29
#> [31] rmarkdown_2.13 XVector_0.32.0
#> [33] pkgconfig_2.0.3 htmltools_0.5.2
#> [35] sparseMatrixStats_1.4.2 MatrixGenerics_1.4.3
#> [37] fastmap_1.1.0 highr_0.9
#> [39] htmlwidgets_1.5.4 rlang_1.0.2
#> [41] rstudioapi_0.13 RSQLite_2.2.12
#> [43] DelayedMatrixStats_1.14.3 jquerylib_0.1.4
#> [45] farver_2.1.0 generics_0.1.2
#> [47] jsonlite_1.8.0 crosstalk_1.2.0
#> [49] BiocParallel_1.26.2 dplyr_1.0.8
#> [51] BiocSingular_1.8.1 RCurl_1.98-1.6
#> [53] magrittr_2.0.3 GenomeInfoDbData_1.2.6
#> [55] Matrix_1.4-1 Rhdf5lib_1.14.2
#> [57] Rcpp_1.0.8.3 munsell_0.5.0
#> [59] S4Vectors_0.30.2 fansi_1.0.3
#> [61] lifecycle_1.0.1 stringi_1.7.6
#> [63] yaml_2.3.5 SummarizedExperiment_1.22.0
#> [65] zlibbioc_1.38.0 rhdf5_2.36.0
#> [67] plyr_1.8.7 grid_4.1.1
#> [69] blob_1.2.2 parallel_4.1.1
#> [71] crayon_1.5.1 lattice_0.20-45
#> [73] beachmat_2.8.1 Biostrings_2.60.2
#> [75] annotate_1.70.0 KEGGREST_1.32.0
#> [77] locfit_1.5-9.5 knitr_1.38
#> [79] pillar_1.7.0 GenomicRanges_1.44.0
#> [81] ScaledMatrix_1.0.0 stats4_4.1.1
#> [83] XML_3.99-0.9 glue_1.6.2
#> [85] evaluate_0.15 data.table_1.14.2
#> [87] png_0.1-7 vctrs_0.4.0
#> [89] tidyr_1.2.0 gtable_0.3.0
#> [91] purrr_0.3.4 assertthat_0.2.1
#> [93] cachem_1.0.6 xfun_0.30
#> [95] rsvd_1.0.5 xtable_1.8-4
#> [97] viridisLite_0.4.0 SingleCellExperiment_1.14.1
#> [99] tibble_3.1.6 proftools_0.99-3
#> [101] AnnotationDbi_1.54.1 memoise_2.0.1
#> [103] IRanges_2.26.0 ellipsis_0.3.2
#> [105] GSEABase_1.54.0